Levenshtein Distances Fail to Identify Language Relationships Accurately

نویسنده

  • Simon J. Greenhill
چکیده

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of the time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages. This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data

The Levenshtein dialect distance method has proven to be a successful method for measuring phonetic distances between Dutch dialects. The aim of the present investigation is to validate the Levenshtein dialect distance with perceptual data from a language area other than the Dutch, namely Norway. We calculate the correlation between the Levenshtein distances and the distances between 15 Norwegi...

متن کامل

Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data

The Levenshtein dialect distance method has proven to be a successful method for measuring phonetic distances between Dutch dialects. The aim of the present investigation is to validate the Levenshtein dialect distance with perceptual data from a language area other than the Dutch, namely Norway. We calculate the correlation between the Levenshtein distances and the distances between 15 Norwegi...

متن کامل

Linguistic distance as a determinant of the mutual intelligibility between Netherlandic and Belgian Dutch language varieties

Research on the mutual intelligibility of closely related Germanic languages has shown that several linguistic and extra-linguistic factors determine intelligibility scores to a high degree. In this paper, we aim to pinpoint the precise role of the determinant phonetic distance. As for example Gooskens (2007) shows, aggregate Levenshtein distances turn out to be good predictors of the intelligi...

متن کامل

Lexical evolution rates by automated stability measure

Phylogenetic trees can be reconstructed from the matrix which contains the distances between all pairs of languages in a family. Recently, we proposed a new method which uses normalized Levenshtein distances among words with same meaning and averages on all the items of a given list. Decisions about the number of items in the input lists for language comparison have been debated since the begin...

متن کامل

Measuring Norwegian dialect distances using acoustic features

Computational dialectometry has been proven to be useful for finding dialect relationships and identifying dialect areas. The first to develop a method of measuring dialect distances was Jean Séguy, assisted and inspired by Henri Guiter (Chambers and Trudgill, 1998). Strongly related to the methodology of Séguy is the work of Goebl, although the basis of Goebl’s work was developed mainly in dep...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computational Linguistics

دوره 37  شماره 

صفحات  -

تاریخ انتشار 2011